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H Abstract — We present new adaptive format for storing sparse matrices on 

^^ GPU. We compare it with several other formats including CUSPARSE which 

^ is today probably the best choice for processing of sparse matrices on GPU in 

\^ CUDA. Contrary to CUSPARSE which works with common CSR format, our 

fSJ new format requires conversion. However, multiplication of sparse-matrix 

and vector is significantly faster for many matrices. We demonstrate it on 
set of 1 600 matrices and we show for what types of matrices our format is 
profitable. 
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CPUs are specialized devices offering high computational power as well 
as high memory throughput [6l [11]. Their architecture is very similar to 
r^^ processor arrays [12]. CPUs are efficient for algorithms exhibiting massively 

fT^ parallel and homogeneous computations. Even though most operations from 

r^ linear algebra are not computationally intensive, corresponding algorithms 

'^ can profit from high memory throughput reaching almost 200 GB/s. This 

(^ has been demonstrated in many works dealing with dense matrices [5l [U [H] . 

"O Situation is more complicated in case of sparse matrices. They often have 

^^ irregular pattern of non-zero elements and they significantly reduce available 

. . parallelism. It makes processing of the sparse matrices on GPU difficult but 

^ also challenging. 

In this article we concentrate on the sparse-matrix vector multiplication 
(SpMV) since it is a key part of many iterative solvers for linear systems. 
The performance is infiuenced mainly by the format used to store the 
matrix. Formats for storing the sparse matrices often involves additional 
data, typically column indexes. On the vector architectures, the data must 
by aligned in the memory. CUDA developers speak about coalesced global 
memory accesses. Their importance for the efficiency of SpMV is discussed 
in j2l[in|. For this reason, artificial zeros must be often inserted. Moreover, 
with different number of non-zero elements in each row, the multiprocessors 
may be load unequally. Efficient format should address the following: 

(i) reduce amount of additional data while keep coalesced global memory 

accesses, 
(ii) distribute the non-zero elements of the matrix evenly. 

1.1. Contribution 

We present new format for storing the sparse-matrices on GPU. It is 
optimized for matrix-vector multiplication. With some types of matrices. 
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Figure 1. CSR format 



it is several times faster compared to today state-of-the-art formats like 
CUSPARSE, Hybrid (in CUSP library) [2j, Row-grouped CSR [lO] and 
Sliced ELLPACK [7] . We analyze on what type of matrices the new format 
outperforms the others and vice versa. 

1.2. Organization 

The article is organized as follows. In Section [2| we briefly summarize 
existing formats for storing sparse matrices on GPU. The new format is 
presented in Section [3] together with description of conversion from CSR 
format. The implementation in CUDA with a source code of matrix-vector 
multiplication kernel is in Section[4J Achieved results with short performance 
analysis are topics of the last Section [5| 



2. FORMATS FOR SPARSE MATRICES ON GPU 

CSR (Compressed Sparse Rows) format (Figure [l]) is standard and most 
popular format for storing sparse matrices |i3j. It consists of arrays values 
storing rowisely the non-zero elements of the matrix, columns storing column 
index of each non-zero element and rowPointers containing offset of each 
matrix row in arrays values and columns. This format is easy to implement. 
Storing data in arrays improves efficiency of data transfer. 

Works by Bell and Garland [2] or Bautois et al. [3] showed that this 
format is inefficient on GPU for matrix-vector multiplication. Better formats 
are based on the ELLPACK format (Figure ^ by Monakov and Avetisian [7] 
or our similar format studied in [TO]. Formats based on ELLPACK require 
homogeneous distribution of non-zero elements in rows. If the number of 
non-zero elements in each row is very different, the ELLPACK format loses 
efficiency. Bell and Garland [2] proposed Hybrid format which is part of 
CUSP library. It has also achieved great popularity. Recently CUSPARSE 
[S] showed that even pure CSR format can be implemented efficiently on 
GPU. Nevertheless, tests on large sets of sparse matrices like those in |10j 
show that there are a lot of matrices for which common CPU performs better. 
There is still great potential to improve formats for sparse matrices on GPU. 
The ELLPACK format, as depicted on the Figure [2j allocates the same 
number of elements for each matrix row. If there are less non-zero elements 
on some row, artificial zeros are added to align the data. This increases 
memory requirements and slow down matrix-vector multiplication because 
more data must be transferred. On GPU, usually one thread is mapped to 
one row. Arrays values and columns are stored in global memory and so the 
accesses to these arrays must coalesced (see [21 [10| ) . This is reason why these 
arrays are stored columnwise instead of rowise. Since the coalesced memory 
accesses must be fulfilled only for threads in one warp we may split the 
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Figure 2. ELLPACK format 

matrix into slices of rows processed by the same warp. The shces are stored 
separately. If there is a row having significantly more non-zero elements 
than the others, only one slice is affected. This modification, introduced 
independently in [TJ [TO], reduces number of artificial zeros required by 
original ELLPACK format. However, for some matrices these formats still 
generate too many artificial zeros and matrix-vector multiplication can be 
several orders slower compared to CSR format on CPU. 

The next step is to split long rows (i.e. with a lot of non-zero elements) 
and process them by more threads. The matrix is again divided into groups 
of rows. We introduce chunks of non-zero elements. Each chunk in the same 
group has the same size. One row can be splitted into more chunks (but 
one chunk cannot cross boundary of one row). Chunks are stored in the 
same way as matrix rows in the ELLPACK format (also stored columnwise 
to get the coalesced memory accesses). When performing the matrix- vector 
multiplication, we map one thread to each chunk. First the chunks are 
processed which results to array of partial sums. If some row consists of 
more chunks, its partial sums must be summed to get the final result. In 
the following text, we explain this format in more details. 



3. NEW ADAPTIVE ROW-GROUPED CSR FORMAT 

Figure [3] shows an example of matrix for which the ELLPACK format is 
inefficient. All rows except the last one have only one non-zero element and 
the last row is full. Assume that we map 12 threads to this matrix. We first 
assign one thread to each row. The thread mapped to the last row must 
process 8 elements. Therefore the chunk size is 8 and all other threads must 
process 8 elements as well because they belong to the same warp. We would 
have to allocate 49 artificial zeros to align the data. Since we have 4 threads 
left we may use them to diminish the chunk size by mapping them to the 
last row. As depicted in the Figure [3j if there are 4 threads assigned to the 
last row, the chunk size is only 2. One thread remains free and we need 
to allocate only 7 artificial zeros. To ensure the coalesced global memory 
accesses we store the chunks in the same way as rows are stored in Sliced 
ELLPACK format [7j or Row-grouped CSR format [TO]. In other words, if 
chunks were rows of some matrix, the matrix would be stored columnwise. 
Large matrices are splitted into small groups of rows. Each group is 
defined by the following structure (Listing [I]) : 

Listing 1. Structure defining group of rows 



struct argcsrGroupInfo 
{ 



int firstRo\ 
int size : 
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Figure 3. Example demonstrating the Adaptive Row-grouped CSR format: 
a) the original matrix, b) mapping of chunks to matrix rows - each chunk 
is depicted in different greyscale, c) arrays with the non-zero elements 
(values), column indexes (columns) and mapping of threads to rows 
(threadsMapping). 
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Parameter f irstRow is the first matrix row and size is number of rows in 
the group i.e. the group contains rows indexed from f irstRow to f irstRow 
+ size - 1. Parameter offset points to the first element of the group in 
arrays values and columns and chunkSize describes number of elements 
in chunk. It is constant in each group but it can be different for different 
groups. The number of chunks is the same for each group and equals CUDA 
block size. For each group there are chunkSize * blockDim.x elements in 
arrays values and columns starting at position offset. 

Let us now comment a conversion from the CSR format. For good 
performance, it is necessary that all groups have approximately the same 
number of non-zero elements. Otherwise the load balance of multiprocessors 
would be unequal. One parameter, entering the converting algorithm 
is desiredChunkSize. We allocate new group, read the matrix row 
by row and compute number of non-zero elements. Once it is larger 
than desiredChunkSize * blockDim.x or there would be more rows than 
blockDim.x in the current group, we close it and allocate new one. The 
groups are defined by array of structures argcsrGroupInf o. In the next 
step of the conversion, we compute the chunk size in each group. This 
can be done in parallel. We start by mapping one thread to each row of 
the group. The chunk size would be now equal to the number of non-zero 
elements in row with greatest filling. We define the chunk filling as number of 
non-zero elements in given chunk. If there are some threads left, we always 
find row with greatest chunk filling and map one more thread to it. When 
all threads are distributed to the rows we can compute the final chunk size. 
We also need to store mapping of threads to rows. For this we allocate array 
globalThreadsMapping. It contains number of threads mapped to each row 
of matrix. Then we perform exclusive prefix-sum on this array separately 
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for each group. The last step is fihing the arrays values and columns by 
data belonging to particular chunks. Since we know the chunk size in each 
group we know how many elements will be inserted by each group. This 
phase thus can be done in parallel as well. We read the data chunkwise from 
CSR format and copy them in appropriate order to the mentioned arrays. 
In the next section we explain matrix- vector multiplication. 

4. IMPLEMENTATION IN CUDA 

For better understanding, we show the source code of the kernel in CUDA 
- see Listing [2] 

Listing 2. Kernel for matrix-vector multiplication 



1 

2 

3 

4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
24 
25 
26 
27 
28 
29 
30 
31 
32 
33 
34 
35 
36 
37 
38 
39 
40 
41 
42 
43 
44 
45 
46 
47 
48 
49 
50 
51 



template< class Real > 
--global-- void spinvKcrnel( 

Real* target , 

Real* vect , 

Real* values , 

int * columns , 

argcsrGroupInfo* globalGroupInfo , 

int* globalThreadsMapping ) 



{ 



extern --shared-- int sdata[]; 
const int* globalGroupInfoPointer = 

reinterpret _cast < const int* >( globalGroupInfo ); 
argcsrGroupInfo* grouplnfo = 

reinterpret _cast < argcsrGroupInfo* >( &sdata [ ] ); 
int* threadsMapping = 

reinterpret _cast < int* >( &sdata [ 4 ] ); 
Real* partialSums = 

reinterpret -Cast < Real* >( &sdata [ 4 + blockDim . x] ); 

int bid = blockldx.x; 

/* * * * 

* Fetch the group info from the global memory 
*/ 

if( threadldx.x < 4 ) 

sdata[ threadldx.x ] = 

globalGroupInfoPointer [ 4 * bid + threadldx.x ]; 
__syncthreads (); 

/* * * * 

* Fetch mapping of threads to rows. 
*/ 

if( threadldx . x < grouplnfo — > size ) 
threadsMapping [ threadldx . x ] = 

globalThreadsMapping [ grouplnfo — > firstRow + threadldx . x 

/* * * * 

* Each thread computes partial sum in its chunk 
*/ 

Real sum = 0; 

int threadOffset = grouplnfo — > offset + threadldx. x; 

for ( int i = 0; i < grouplnfo — > chunkSize ; i ++ ) 

{ 

const int column = columns [ threadOffset ] ; 
if( column != —1 ) 

sum += values [ threadOffset ] * vect [ column ] ; 
else 

break ; 
threadOffset += blockDim. x; 

} 

partialSums [ threadldx. x ] = sum; 
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syncthreads ( ) ; 


53 








54 








55 




/**** 1 


56 




* 


Sum the partial sums in each row 


57 




* 


/ 


58 




if 


( thrcadldx . x < grouplnfo — > size ) 


59 




{ 




60 






sum = ; 


61 






int begin ( ) ; 


62 






const int row = grouplnfo — > firstRow + threadldx . x; 


63 






if ( threadldx . x > ) 


64 






begin = threadsMapping [ threadldx. x — 1]; 


65 






int end = threadsMapping [ threadldx . x ] ; 


66 






for( int i = begin; i < end; i++ ) 


67 






sum += partialSums [ i ] ; 


68 






target [ row ] = sum; 


69 




} 




70 


} 







The kernel first fetch data with reuse to fast shared memory. We 
allocate shared memory (lines 10-18) for one argcsrGroupInf o structure, 
array threadsMapping keeping track of thread indexes mapped to rows of 
the group and array partialSums meaning of which is explained later. Then 
we fetch the group info structure. To achieve coalesced memory access we 
employ four threads to this task (lines 25-28). Thread synchronization is 
important here. Next, we may fetch array with mapping of threads to rows. 
The next part is independent on this array and therefore synchronization is 
not necessary. Each thread takes its own chunk and perform multiplication 
of this part of matrix data with given part of input vector. Result of this is 
partial sum. We remind that we mark artificial zeros by column index -1. 
Once a thread reaches this column index it exits the loop on lines 42-50. If 
it happens in whole warp, it exits too and it omits the rest of artificial zeros. 
The last step is summing of the partial sums. There is data dependency with 
the previous part and so thread synchronization on the line 52 is necessary. 

5. RESULTS 

The results, we present in this section, were obtained on a system equipped 
with CPU AMD Phenom II X6 HOOT with 16 GB DDR3 RAM and GPU 
Nvidia Tesla C2070 having 6 GB GDDR5 with memory bandwidth 144 
GB/s (ECC was turned off). All tests were done in double precision and 
sequentially on CPU. Testing matrices were fetched from matrix databases 
[Ull]. The testing set contained almost 1 600 sparse square matrices. 

The best performance was achieved with 128 threads in block. Test 
show that the parameter desiredChunkSize can have strong impact on 
the efficiency. Simple rule might be: the more regular the matrix is (in 
sense that there are almost the same non-zero elements in each row), the 
larger the desired chunk size should be. With desired chunk size 32 we 
have achieved the highest performance almost 18 GFLOPS with matrices 
Schenk_AFE originating in structural problems. With desiredChunkSize 
set to one, the performance dropped to 11 GFLOPS for this matrix. On the 
other hand, with the matrix rajat23, the performance was six times higher 
with desiredChunkSize 1 (5.1 GFLOPS and speed-up 11 compared to CSR 
on CPU) than with desiredChunkSize 32 (0.81 GFLOPS). In the rest of 
this section we present results obtained with the desired chunk size set to 1 
because it seems to be more robust setting. 
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Figure |4] shows speed-up of tested formats compared to CSR format 
on CPU. The vertical axis shows the speed-up in logarithmic scale and the 
horizontal axis shows on how many matrices the formats attain given speed- 
up. There is one curve for each format and the slower it decreases the better 
the format is. Our test shows that the Hybrid format, Row-grouped CSR 
format, CUSPARSE library and the new format are is faster for 726, 907, 
994 and 1168 matrices of 1600 respectively. This figure also shows that there 
are few matrices for which CPU is two orders of magnitude faster. They are 
mainly small matrices having tens or hundreds of unknowns as our previous 
test in [lOj show. 

Figure [5] shows speed-up of the new format compared to the others. The 
vertical axis shows speed-up in logarithmic scale while the horizontal number 
of matrices for which the new format attains given speed-up compared to 
other formats. The higher the curve is, the better the new format is. Our test 
show that the the Hybrid format is slower on 1318 matrices, Row-grouped 
CSR on 1072 matrices and CUSPARSE on 1358 matrices. 

Both Figures|4]and[5]demonstrate that the performance of sparse-matrix 
and vector multiplication on GPU is very variable. If high performance is 
the top priority, one should test more formats and choose the best one. 
For example the CUSPARSE library is almost 4 times faster than the new 
format on TSOPF and case39 matrix sets from ^. Both types of matrices 
model "Transient stability-constrained optimal power flow" and usually the 
Hybrid format outperforms the others in these cases. On the other hand, 
the new format is ten times faster than CUSPARSE (and at the same time 
5-8 times faster than CSR on CPU) for matrices raj, raj at, GHS.indef, 
IBM_EDA. These matrices come from circuit simulations or optimizations 
problems. Original Row-grouped CSR format (or similar Sliced ELLPACK) 
is almost twice faster for norris/torso2 and t2d_q matrices originating in 
finite difference methods. The mentioned matrices are all difficult to visualize 
in this paper because of very large dimensions and we refer reader to [4J for 
more details. 
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